#Libraries
library(knitr)
library(dplyr)
library(plotly)
library(ggcorrplot)
library(ggmosaic)
library(scales)
library(fastDummies)
library(caret)
colors <- c('#0d0887', '#46039f', '#7201a8', '#9c179e', '#bd3786', '#d8576b', '#ed7953', '#fb9f3a', '#fdca26', '#f0f921')In 1963, the Equal Pay Act was passed, requiring equal wages to men and women in all workplaces. Despite the federal law against gender inequality and discrimination, the unjust practices continued in the workplace. Some of the most prominent forms of workplace gender discrimination include compensation and promotion. While some progress has been made, workplace gender discrimination and inequality are complex and systemic issues that continue to persist.
This analysis utilizes the data from a United States District Court of Houston case that arose under Title VII of the Civil Rights Act of 1964. The plaintiffs in the case were female doctors at Houston College of Medicine. The plaintiffs claimed that the University was engaging in patterns and practices of discrimination against women, most notably in gender pay and promotion discrimination.
The purpose of this analysis is to investigate the dataset used in the gender discrimination case against Houston College of Medicine to analyze and conduct inferences about gender pay and promotion inequalities.
Specifically, this analysis is aiming to answer the following research questions:
Is the population mean salary greater for male university doctors than female university doctors? Does the inference change when controlling for other factors?
Is there a regression equation to predict the salary of a university doctor based on the variables available? Which factors are most important in predicting salary and is gender a significant predictor?
Are female doctors less likely to be promoted? Such that, is the underlying population proportion of full professors among male university doctors greater than the population proportion of full professors among female university doctors?
The dataset used for this analysis was presented by the plaintiffs in the gender discrimination lawsuit, presenting relevant information about the background, experience, and salary of 261 doctors.
Source: https://www.kaggle.com/hjmjerry/gender-discrimination?select=Lawsuit.csv
The table below displays the first few rows of the dataset:
data <- read.csv("data.csv")
colnames(data) <- c("ID", "Dept", "Gender", "Emphasis", "Certified", "Publication.Rate", "Experience", "Rank", "Salary.Year1", "Salary.Year2")
kable(head(data))| ID | Dept | Gender | Emphasis | Certified | Publication.Rate | Experience | Rank | Salary.Year1 | Salary.Year2 |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Biology | Male | Research | Not_Certified | 7.4 | 9 | Full_Professor | 77836 | 84612 |
| 2 | Biology | Male | Research | Not_Certified | 6.7 | 10 | Associate | 69994 | 78497 |
| 3 | Biology | Male | Research | Not_Certified | 8.1 | 6 | Assistant | 62872 | 67756 |
| 4 | Biology | Male | Clinical | Board_Certified | 5.1 | 27 | Full_Professor | 155196 | 173220 |
| 5 | Biology | Male | Research | Not_Certified | 7.0 | 10 | Full_Professor | 89268 | 96099 |
| 6 | Biology | Male | Research | Board_Certified | 7.7 | 10 | Full_Professor | 79714 | 87531 |
# TO DO - Delete
data$Log.Salary <- log(data$Salary.Year1)
male <- data[which(data$Gender == "Male"),]
female <- data[which(data$Gender == "Female"),]The list below describes the variables of interest included in the dataset:
As part of the data preparation process, the first part is to clean the data. The five steps below outline the process of data cleaning to ensure the data is sound when used for analysis.
unique <- length(unique(data$ID)) == nrow(data)#Decided to not factor categorical datasalary <- data$Salary.Year1
stats <- summary(salary)
iqr <- stats[5] - stats[2]
lower <- stats[2] - (1.5 * iqr)
upper <- stats[5] + (1.5 * iqr)
outliers <- salary[which(salary > upper | salary < lower)]na.rows <- data[rowSums(is.na(data)) > 0,]For the majority of the analysis, “Salary Year 1” will be the independent variable, and thus, the focus of the analysis.
The histogram below examines the distribution of salary. Notice that the variable has a significant right skew. This attribute is problematic because when the distribution does not follow a normal curve, we cannot make certain assumptions required for various analyses.
plot_ly(data, x = ~Salary.Year1, type = "histogram", marker = list(color = colors[1]), opacity = .7) %>%
layout(title = "Histogram of Salary", xaxis = list(title = "Salary ($)"), yaxis = list(title = "Frequency"))However, this problem can be fixed with a logarithmic transformation of salary to make the data follow a normal distribution. The histogram below demonstrates the log transformation of salary. Notice that the variable now follows an approximately normal distribution, thus, the statistical analyses can be performed on the log transformation of salary.
data$Log.Salary <- log(data$Salary.Year1)
plot_ly(data, x = ~Log.Salary, type = "histogram", marker = list(color = colors[2]), opacity = .7) %>%
layout(title = "Histogram of Log Transformation of Salary", xaxis = list(title = "Log Salary ($)"), yaxis = list(title = "Frequency"))Before we being our analysis, let’s explore the variables included in the dataset to gain a better understanding of the research scenario.
The pie chart below visualizes the percentage of males and females included in the dataset. There is a sufficient number of observations for each gender to conduct an analysis.
plot_ly(labels = c("Male", "Female"), values = c(nrow(male), nrow(female)), type = "pie", marker = list(colors = colors[2:1], line = list(color = "white", width = 2)), opacity = .6, textinfo = "label+percent", showlegend = FALSE) %>%
layout(title = "Percentage of Obervations by Gender")The following density graph illustrates the distribution of salary (without the logarithmic transformation) by gender. Notice that the female distribution is much narrower than the male distribution. Additionally, notice that both the male and female distributions have a right-skew. Recall that the right-skew is reflected in the general distribution of salary.
density.f <- density(female$Salary.Year1)
density.m <- density(male$Salary.Year1)
plot_ly(x = ~density.f$x, y = ~density.f$y, type = "scatter", mode = "lines", name = "Female", fill = "tozeroy", fillcolor="#0d0887CC", line=list( color="#0d0887CC")) %>%
add_trace(x = ~density.m$x, y = ~density.m$y, name = "Male", fill = "tozeroy", fillcolor="#46039fCC", line=list( color="#46039fCC")) %>%
layout(title = "Density of Salary by Gender", xaxis = list(title = "Salary ($)"), yaxis = list(title = "Density"))The bar chart below examines the median salary for each department. Notice that departments such as Surgery have a higher median pay than departments such as Biology and Physiology. This is an important factor to consider during the analysis as some departments tend to pay higher than other departments.
dept <- aggregate(data$Salary.Year1, list(data$Dept), FUN=median)
colnames(dept) = c("Dept", "Median.Salary")
dept$Dept <- factor(dept$Dept, levels = dept$Dept[order(dept$Median.Salary)])
plot_ly(dept, y = ~Dept, x = ~Median.Salary, color = ~Dept, colors = colors[10:1], type = "bar", orientation = "h", opacity = .7) %>%
layout(title = "Median Salary by Department", xaxis = list(title = "Median Salary ($)"), yaxis = list(title = "Department"))The following mosaic plots examine the frequency of male and female doctors that belong to the categories included in the variables Rank, Certification, and Emphasis. In the mosaic plots for Certification and Emphasis, there does not appear to be any significant disparities between the percentage of male versus female doctors that are board certified & not board certified and clinical emphasis & research emphasis. However, in the mosaic plot for Professorship Level, notice the low percentage of male assistant professors compared to female assistant professors and the high percentage of male full professors compared to female full professors.
plt <- ggplot(data = data) +
geom_mosaic(aes(x = product(Gender), fill = Rank)) +
xlab("Gender") +
ylab("Rank") +
ggtitle("Mosaic Plot of Professorship Level by Gender") +
scale_fill_manual(values=colors[3:1])
ggplotly(plt)plt <- ggplot(data = data) +
geom_mosaic(aes(x = product(Gender), fill = Certified)) +
xlab("Gender") +
ylab("Certified") +
ggtitle("Mosaic Plot of Certification by Gender") +
scale_fill_manual(values=colors[5:4])
ggplotly(plt)plt <- ggplot(data = data) +
geom_mosaic(aes(x = product(Gender), fill = Emphasis)) +
xlab("Gender") +
ylab("Emphasis") +
ggtitle("Mosaic Plot of Emphasis by Gender") +
scale_fill_manual(values=colors[7:6])
ggplotly(plt)The boxplots below examine the distribution of salary for the variables Rank, Certification, and Emphasis grouped by gender. Notice that the full professor, board certified, and clinical emphasis have a higher median salary than their respective catagories. However, there are distinct distribution differences between females and males among the three variables.
plot_ly(data, x = ~Rank, y = ~Salary.Year1, color = ~Gender, colors = colors[1:2], type = "box", showlegend = TRUE) %>%
layout(boxmode = "group", title = "Distribuiton of Salary for Professorship Level by Gender", yaxis = list(title = "Salary ($)"))plot_ly(data, x = ~Certified, y = ~Salary.Year1, color = ~Gender, colors = colors[3:4], type = "box", showlegend = TRUE) %>%
layout(boxmode = "group", title = "Distribuiton of Salary for Certification by Gender", yaxis = list(title = "Salary ($)"))plot_ly(data, x = ~Emphasis, y = ~Salary.Year1, color = ~Gender, colors = colors[5:6], type = "box", showlegend = TRUE) %>%
layout(boxmode = "group", title = "Distribuiton of Salary for Emphasis by Gender", yaxis = list(title = "Salary ($)"))The scatter plots below visualizes the linear association between salary versus publication rate and experience. Notice the moderate-strong negative linear relationship between salary and publication rate (r = -0.73), meaning as the publication rate decreases, the salary is likely to increase. However, notice the weak-moderate positive linear relationship between the salary and experience (r = 0.30), meaning as the number of years of experience increases, the salary could possibly increase. However, the correlation coefficient indicates that experience is not a strong predictor of salary.
r <- cor(data$Publication.Rate, data$Salary.Year1)
plot_ly(data, x = ~Publication.Rate, y = ~Salary.Year1, type = "scatter", mode = "markers", color = ~Salary.Year1, colors = colors, size = ~Salary.Year1) %>%
layout(title = "Scatter Plot of Publication Rate vs. Salary", xaxis = list(title = "Publication Rate"), yaxis = list(title = "Salary ($)")) %>%
hide_colorbar()r <- cor(data$Experience, data$Salary.Year1)
plot_ly(data, x = ~Experience, y = ~Salary.Year1, type = "scatter", mode = "markers", color = ~Salary.Year1, colors = colors, size = ~Salary.Year1) %>%
layout(title = "Scatter Plot of Experience vs. Salary", xaxis = list(title = "Experience (Years)"), yaxis = list(title = "Salary ($)")) %>%
hide_colorbar()We are interested in whether or not university doctors that are male have a greater salary than university doctors that are female. To conduct this assessment, we will perform a two-sample test of means to make an inference about the difference in population means of the two groups.
The following conditions must be met in order to make the necessary inferences from the two-sample t test:
The following steps conduct a hypothesis test to make an inference about the difference in population means of the salary of male university doctors and female university doctors.
Let \(\mu_1\) = the population mean of male university doctors’ salaries and \(\mu_2\) = the population mean of female university doctors’ salaries
alpha <- 0.05df <- min(nrow(male)-1, nrow(female)-1)Determine the appropriate critical value from the standard t-distribution table associated with a right hand tail probability of \(\alpha = 0.05\) and \(df = 105\). The appropriate critical value is 1.659.
critical.value <- qt(p=alpha, df=df, lower.tail=FALSE)t <- (mean(male$Salary.Year1) - mean(female$Salary.Year1)) / (sqrt((var(male$Salary.Year1)/length(male$Salary.Year1)) + (var(female$Salary.Year1)/length(female$Salary.Year1))))
t.test <- t.test(male$Salary.Year1, female$Salary.Year1, alternative = "greater", conf.level=.95)Reject \(H_0\) since \(6.646 > 1.659\). There is significant evidence at the \(\alpha = 0.05\) level that the population mean salary is greater for male university doctors than female university doctors.
The confidence interval for the difference in population means is given by the following formula:
We are 95% confident that the true mean difference in the salary between male university doctors and female university doctors is between $43,868.76 and $73,066.22.
dif <- mean(male$Salary.Year1) - mean(female$Salary.Year1)
se <- sqrt((var(male$Salary.Year1)/length(male$Salary.Year1)) + (var(female$Salary.Year1)/length(female$Salary.Year1)))
lower <- dif - (critical.value * se)
upper <- dif + (critical.value * se)From the two-sample test of means above, we concluded that the population mean salary is greater for male university doctors than female university doctors. However, this statement does include a level of basis as we were not accounting for the doctor’s professorship level (rank), emphasis, or certification. As we discovered in the data exploration section, professorship level (Assistant, Associate, or Full Professor), certification (Board Certified or Not Certified), and emphasis (Clinical or Research) all influence the doctor’s salary. Note that while department also influences the doctor’s salary, there is not a sufficient number of observations to filter on each department.
We are interested in whether or not university doctors that are male have a greater salary than university doctors that are female when controlling for professorship level, certification, and emphasis. To conduct this assessment, we will perform a two-sample test of means to make an inference about the difference in population means of the two groups (male - female) for each of the 12 categories.
The following table summarizes the results from the hypothesis test for each category. Note that categories with “NA” did not have a sufficient number of observations to complete the test.
# Assistant
#Assistant, Certified, Clinical
m1 <- male[male$Rank == "Assistant" &
male$Certified == "Board_Certified" &
male$Emphasis == "Clinical",]
f1 <- female[female$Rank == "Assistant" &
female$Certified == "Board_Certified" &
female$Emphasis == "Clinical",]
t1 <- t.test(m1$Salary.Year1, f1$Salary.Year1, alternative = "greater", conf.level=.95)
#Assistant, Certified, Research
m2 <- male[male$Rank == "Assistant" &
male$Certified == "Board_Certified" &
male$Emphasis == "Research",]
f2 <- female[female$Rank == "Assistant" &
female$Certified == "Board_Certified" &
female$Emphasis == "Research",]
#t2 <- t.test(m2$Salary.Year1, f2$Salary.Year1, alternative = "greater", conf.level=.95)
t2 <- "na"
#Assistant, Not Certified, Clinical
m3 <- male[male$Rank == "Assistant" &
male$Certified == "Not_Certified" &
male$Emphasis == "Clinical",]
f3 <- female[female$Rank == "Assistant" &
female$Certified == "Not_Certified" &
female$Emphasis == "Clinical",]
t3 <- t.test(m3$Salary.Year1, f3$Salary.Year1, alternative = "greater", conf.level=.95)
#Assistant, Not Certified, Research
m4 <- male[male$Rank == "Assistant" &
male$Certified == "Not_Certified" &
male$Emphasis == "Research",]
f4 <- female[female$Rank == "Assistant" &
female$Certified == "Not_Certified" &
female$Emphasis == "Research",]
t4 <- t.test(m4$Salary.Year1, f4$Salary.Year1, alternative = "greater", conf.level=.95)
# Associate
#Associate, Certified, Clinical
m5 <- male[male$Rank == "Associate" &
male$Certified == "Board_Certified" &
male$Emphasis == "Clinical",]
f5 <- female[female$Rank == "Associate" &
female$Certified == "Board_Certified" &
female$Emphasis == "Clinical",]
t5 <- t.test(m5$Salary.Year1, f5$Salary.Year1, alternative = "greater", conf.level=.95)
#Associate, Certified, Research
m6 <- male[male$Rank == "Associate" &
male$Certified == "Board_Certified" &
male$Emphasis == "Research",]
f6 <- female[female$Rank == "Associate" &
female$Certified == "Board_Certified" &
female$Emphasis == "Research",]
t6 <- t.test(m6$Salary.Year1, f6$Salary.Year1, alternative = "greater", conf.level=.95)
#Associate, Not Certified, Clinical
m7 <- male[male$Rank == "Associate" &
male$Certified == "Not_Certified" &
male$Emphasis == "Clinical",]
f7 <- female[female$Rank == "Associate" &
female$Certified == "Not_Certified" &
female$Emphasis == "Clinical",]
#t7 <- t.test(m7$Salary.Year1, f7$Salary.Year1, alternative = "greater", conf.level=.95)
t7 <- "na"
#Associate, Not Certified, Research
m8 <- male[male$Rank == "Associate" &
male$Certified == "Not_Certified" &
male$Emphasis == "Research",]
f8 <- female[female$Rank == "Associate" &
female$Certified == "Not_Certified" &
female$Emphasis == "Research",]
t8 <- t.test(m8$Salary.Year1, f8$Salary.Year1, alternative = "greater", conf.level=.95)
# Full Professor
#Full Professor, Certified, Clinical
m9 <- male[male$Rank == "Full_Professor" &
male$Certified == "Board_Certified" &
male$Emphasis == "Clinical",]
f9 <- female[female$Rank == "Full_Professor" &
female$Certified == "Board_Certified" &
female$Emphasis == "Clinical",]
t9 <- t.test(m9$Salary.Year1, f9$Salary.Year1, alternative = "greater", conf.level=.95)
#Full Professor, Certified, Research
m10 <- male[male$Rank == "Full_Professor" &
male$Certified == "Board_Certified" &
male$Emphasis == "Research",]
f10 <- female[female$Rank == "Full_Professor" &
female$Certified == "Board_Certified" &
female$Emphasis == "Research",]
t10 <- t.test(m10$Salary.Year1, f10$Salary.Year1, alternative = "greater", conf.level=.95)
#Full Professor, Not Certified, Clinical
m11 <- male[male$Rank == "Full_Professor" &
male$Certified == "Not_Certified" &
male$Emphasis == "Clinical",]
f11 <- female[female$Rank == "Full_Professor" &
female$Certified == "Not_Certified" &
female$Emphasis == "Clinical",]
t11 <- t.test(m11$Salary.Year1, f11$Salary.Year1, alternative = "greater", conf.level=.95)
#Full Professor, Not Certified, Research
m12 <- male[male$Rank == "Full_Professor" &
male$Certified == "Not_Certified" &
male$Emphasis == "Research",]
f12 <- female[female$Rank == "Full_Professor" &
female$Certified == "Not_Certified" &
female$Emphasis == "Research",]
t12 <- t.test(m12$Salary.Year1, f12$Salary.Year1, alternative = "greater", conf.level=.95)
categories <- c("Assistant, Certified, Clinical", "Assistant, Certified, Research","Assistant, Not Certified, Clinical", "Assistant, Not Certified, Research", "Associate, Certified, Clinical", "Associate, Certified, Research","Associate, Not Certified, Clinical", "Associate, Not Certified, Research", "Full Professor, Certified, Clinical", "Full Professor, Certified, Research","Full Professor, Not Certified, Clinical", "Full Professor, Not Certified, Research")results <- data.frame(Category = categories,
t.Stat = c(t1$statistic, t2, t3$statistic, t4$statistic, t5$statistic, t6$statistic, t7, t8$statistic, t9$statistic, t10$statistic, t11$statistic, t12$statistic),
p.Value = c(t1$p.value, t2, t3$p.value, t4$p.value, t5$p.value, t6$p.value, t7, t8$p.value, t9$p.value, t10$p.value, t11$p.value, t12$p.value))
results$t.Stat <- round(as.numeric(results$t.Stat), digit = 3)
results$p.Value <- round(as.numeric(results$p.Value), digit = 5)
results$Decision <- ifelse(results$p.Value < alpha, "Reject H0", "Fail to Reject H0")
kable(head(results,12))| Category | t.Stat | p.Value | Decision |
|---|---|---|---|
| Assistant, Certified, Clinical | 2.974 | 0.00213 | Reject H0 |
| Assistant, Certified, Research | NA | NA | NA |
| Assistant, Not Certified, Clinical | 1.745 | 0.05542 | Fail to Reject H0 |
| Assistant, Not Certified, Research | 1.023 | 0.20668 | Fail to Reject H0 |
| Associate, Certified, Clinical | 2.599 | 0.00835 | Reject H0 |
| Associate, Certified, Research | 0.872 | 0.20311 | Fail to Reject H0 |
| Associate, Not Certified, Clinical | NA | NA | NA |
| Associate, Not Certified, Research | 0.584 | 0.28708 | Fail to Reject H0 |
| Full Professor, Certified, Clinical | 1.226 | 0.12253 | Fail to Reject H0 |
| Full Professor, Certified, Research | -0.854 | 0.73049 | Fail to Reject H0 |
| Full Professor, Not Certified, Clinical | 1.278 | 0.20249 | Fail to Reject H0 |
| Full Professor, Not Certified, Research | -0.293 | 0.60895 | Fail to Reject H0 |
Notice that only two categories - 1) Assistant, Certified, Research and 2) Associate, Certified, Clinical - had significant evidence at the \(\alpha = 0.05\) level that the population mean salary is greater for male university doctors than female university doctors. Thus, for the remaining categories, we do not reject the null hypothesis that the population mean salary is the same for male university doctors than female university doctors.
For the categories that were statistically significant, the confidence interval for the difference in population means is provided below:
We are 95% confident that the true mean difference in the salary between male and female university doctors that are assistant professors, certified, and in clinical is between $17,941.25 and $65,911.96.
We are 95% confident that the true mean difference in the salary between male and female university doctors that are associate professors, certified, and in clinical is between $19,102.00 and $107,161.10.
# Assistant, Certified, Research
df <- min(nrow(m1), nrow(f1))
critical.value <- qt(p=alpha, df=df, lower.tail=FALSE)
dif <- mean(m1$Salary.Year1) - mean(f1$Salary.Year1)
se <- sqrt((var(m1$Salary.Year1)/length(m1$Salary.Year1)) + (var(f1$Salary.Year1)/length(f1$Salary.Year1)))
lower <- dif - (critical.value * se)
upper <- dif + (critical.value * se)
# Associate, Certified, Research
df <- min(nrow(m5), nrow(f5))
critical.value <- qt(p=alpha, df=df, lower.tail=FALSE)
dif <- mean(m5$Salary.Year1) - mean(f5$Salary.Year1)
se <- sqrt((var(m5$Salary.Year1)/length(m5$Salary.Year1)) + (var(f5$Salary.Year1)/length(f5$Salary.Year1)))
lower <- dif - (critical.value * se)
upper <- dif + (critical.value * se)We are interested in modeling a multiple linear regression equation to understand the linear relationship between the response variable, salary, and the explanatory variables included in the dataset. Note: A logarithmic transformation was applied to the response variable, salary, to adjust for the normality of the variable.
Before we build the multiple linear regression, we first split the dataset into training set and testing set to perform cross validation on the model. The training set will be used to build the model and the testing set will validate the model by estimating the prediction error. Note: While only shown once, this process is repeated multiple times in the cross validation section of the analysis.
reg.data <- dummy_cols(data, select_columns = c("Dept", "Gender", "Emphasis", "Certified", "Rank"), remove_selected_columns = TRUE)
reg.data <- reg.data[,c(6, 14, 17, 15, 21, 20, 12, 9, 8, 10, 7, 3, 2)]
smp.size <- floor(.8 * nrow(reg.data))
set.seed(1997) # set seed for visualization purposes
train.index <- sample(seq_len(nrow(reg.data)), size = smp.size)
train <- reg.data[train.index, ]
test <- reg.data[-train.index, ]plot_ly(labels = c("Train", "Test"), values = c(.8, .2), type = "pie", marker = list(colors = colors[1:2], line = list(color = "white", width = 2)), opacity = .6, textinfo = "label+percent", showlegend = FALSE) %>%
layout(title = "Train/Test Split")The least-squares regression line for multiple linear regression to predict salary is given by the following equation, where \(\hat{y}\) is the expected or predicted value based on the following variables included in the dataset.
Note: Since several of the variables in the dataset are categorical, dummy variables will be utilized in the regression equation.
\[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1{x_1} + \hat{\beta}_2{x_2} + \hat{\beta}_3{x_3} + \hat{\beta}_4{x_4} + \hat{\beta}_5{x_5} + \hat{\beta}_6{x_6} + \hat{\beta}_7{x_7} + \hat{\beta}_8{x_8} + \hat{\beta}_9{x_9} + \hat{\beta}_{10}{x_{10}} + \hat{\beta}_{11}{x_{11}} + \hat{\beta}_{12}{x_{12}}\]
\[\hat{y} = 10.99 + 0.04{x_1} + 0.18{x_2} + 0.14{x_3} + 0.22{x_4} + 0.14{x_5} + 1.01{x_6} + 0.64{x_7} + 0.32{x_8} + 0.29{x_9} + 0.16{x_{10}} + 0.02{x_{11}} - 0.03{x_{12}}\]
mlr <- lm(Log.Salary~., data = train)plot_ly(data = train, x = ~Experience, y = resid(mlr), type = "scatter", mode = "markers", marker = list(color = colors[1]), opacity = .6) %>%
layout(title = "Scatter Plot of Experience vs. Residuals",
xaxis = list(title = "Experience (years)"),
yaxis = list(title = "Residuals"))plot_ly(data = train, x = ~Publication.Rate, y = resid(mlr), type = "scatter", mode = "markers", marker = list(color = colors[2]), opacity = .6) %>%
layout(title = "Scatter Plot of Publication Rate vs. Residuals",
xaxis = list(title = "Publication Rate"),
yaxis = list(title = "Residuals"))cor_data <- train[,-1]
colnames(cor_data) <- c("Male", "Board Cert", "Clinical", "Full Prof", "Assoc Prof", "Surgery", "Medicine", "Genetics", "Pediatrics", "Biology", "Experience", "Pub Rate")
corr <- cor(cor_data)
corr.plot <- ggcorrplot(corr, type = "lower", tl.cex = 0.2, colors = c(colors[1], "#7201A840", colors[5])) + labs(title = "Correlation Matrix of Explanatory Variables")
ggplotly(corr.plot)predicted <- predict(mlr)
standard_res <- rstandard(mlr)
homosced <- data.frame(predicted, standard_res)
plot_ly(data = homosced, x =~ predicted, y = ~standard_res, type = "scatter", mode = "markers", marker = list(color = colors[3]), opacity = .6) %>%
layout(title = "Standardized Residuals vs. Predicted Values", xaxis = list(title = "Predicted Values"), yaxis = list(title = "Standardized Residuals"))plt <- ggplot(train, aes(sample=Log.Salary)) + geom_qq(color = "#9C179E99") + geom_qq_line() + labs(title = "Normal Q-Q plot", x = "Theoretical Quantiles", y = "Sample Quantiles")
ggplotly(plt)To assess the fit of the multiple linear regression, we will calculate adjusted R-squared (the coefficient of determination), which is the variation in the response variable explained by the multiple regression model, adjusting for the number of predictors in the model.
The R-squared of the multiple linear regression model above is 0.929. This means that 92.9% of the variability in the university doctor’s salary can be explained by the explanatory variables.
adjr2 <- summary(mlr)$adj.r.squaredThe next step to assess the fit of the multiple linear regression is to conduct a global F-test to understand if the explanatory variables are significant predictors of salary. The steps below conduct a hypothesis test for the global F-test.
alpha <- 0.05k <- ncol(train)-1
df1 <- k
df2 <- nrow(train) - k - 1Determine the appropriate critical value from the F-distribution table associated with a right hand tail probability of \(\alpha = 0.05\) and \(df = 12, 195\). The appropriate critical value is 1.96.
critical.value <- qf(p = 0.95, df1 = df1, df2 = df2)mlr.summary <- summary(mlr)Reject \(H_0\) since \(225 > 1.96\). We have significant evidence at the \(\alpha = 0.05\) level that the variables are significant predictors of salary. That is, there is evidence of a significant linear association between the salary of the university doctor and the associated explanatory variables.
Since the global F-test concluded that at least one of the explanatory variables is a significant predictor of salary, the next step is to perform testing on each individual parameter to understand the contribution of each independent variable. To do so, we will perform individual t-tests, controlling for the other variables.
The following steps conduct a hypothesis test for each explanatory variable to understand the contribution of the variable in predicting salary, controlling for the other explanatory variables.
alpha <- 0.05df <- nrow(train)- k -1Determine the appropriate critical value from the standard t-distribution table associated with a right hand tail probability of \(\alpha = 0.05\) and \(df = 195\). The appropriate critical value is 1.653.
critical.value <- qt(p=alpha, df=df, lower.tail=FALSE)The output below shows the results from the individual t-tests for each explanatory variable:
results <- as.data.frame(mlr.summary$coefficients[,3:4])
colnames(results) <- c("t.Stat", "p.Value")
results$decision <- ifelse(results$p.Value < alpha, "Reject H0", "Fail to Reject H0")
kable(results)| t.Stat | p.Value | decision | |
|---|---|---|---|
| (Intercept) | 70.357634 | 0.0000000 | Reject H0 |
| Gender_Male | 1.715522 | 0.0878380 | Fail to Reject H0 |
| Certified_Board_Certified | 7.274367 | 0.0000000 | Reject H0 |
| Emphasis_Clinical | 3.031747 | 0.0027618 | Reject H0 |
| Rank_Full_Professor | 7.329002 | 0.0000000 | Reject H0 |
| Rank_Associate | 4.937636 | 0.0000017 | Reject H0 |
| Dept_Surgery | 14.070001 | 0.0000000 | Reject H0 |
| Dept_Medicine | 12.100846 | 0.0000000 | Reject H0 |
| Dept_Genetics | 7.254096 | 0.0000000 | Reject H0 |
| Dept_Pediatrics | 4.732456 | 0.0000043 | Reject H0 |
| Dept_Biology | 4.789135 | 0.0000033 | Reject H0 |
| Experience | 8.781960 | 0.0000000 | Reject H0 |
| Publication.Rate | -1.455615 | 0.1471071 | Fail to Reject H0 |
For the majority of the explanatory variables, the decision is to reject \(H_0\) since the t-statistic is greater than the critical value. This concludes that the explanatory variable is a significant predictor for salary after adjusting for the other explanatory variables.
However, notice that for Gender (male) and publication rate, the decision is to fail to reject \(H_0\) since the t-statistic is less than the critical value. This concludes that the explanatory variable is not a significant predictor for salary after adjusting for the other explanatory variables. As a result, these variables should be considered for removal from the multiple regression model.
Before we developed a revised multiple regression model, let’s note that in the multiple regression model gender was not a statistically significant predictor of salary when controlling for the other explanatory variables.
In our revised multiple regression model, we only want to include variables that are statistically significant in predicting log salary. One at a time, we will remove varaiables that are insignificant to the model. The revised multiple regression model to predict salary removed gender and publication rate. After the removal of these two variables, all remaining variables were statistically significant in predicting log salary. The equation below represents the revised least-squares regression line for multiple linear regression to predict log salary based on the following explanatory variables:
\[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1{x_1} + \hat{\beta}_2{x_2} + \hat{\beta}_3{x_3} + \hat{\beta}_4{x_4} + \hat{\beta}_5{x_5} + \hat{\beta}_6{x_6} + \hat{\beta}_7{x_7} + \hat{\beta}_8{x_8} + \hat{\beta}_9{x_9} + \hat{\beta}_{10}{x_{10}}\] \[\hat{y} = 10.75 + 0.20{x_1} + 0.22{x_2} + 0.26{x_3} + 0.15{x_4} + 1.10{x_5} + 0.69{x_6} + 0.35{x_7} + 0.37{x_8} + 0.18{x_9} + 0.02{x_{10}}\]
mlr2 <- lm(Log.Salary ~ .-Gender_Male-Publication.Rate, data = train)
mlr2.summary <- summary(mlr2)coef <- mlr2$coefficientsNow that we have developed the multiple linear regression model, we can apply the model to the testing set to predict the outcome of new unseen observations. The output below displays the quality estimates from running cross validation 20 times and the average of the results.
The table below displays the average and the first few runs of the cross validation:
R.sq <- c()
RMSE <- c()
MAE <- c()
for(i in 1:20){
smp.size <- floor(.8 * nrow(reg.data))
train.index <- sample(seq_len(nrow(reg.data)), size = smp.size)
train <- reg.data[train.index, ]
test <- reg.data[-train.index, ]
model <- lm(Log.Salary ~ .-Gender_Male-Publication.Rate, data = train)
predictions <- predict(model, test)
R.sq <- c(R.sq, R2(predictions, test$Log.Salary))
RMSE <- c(RMSE, RMSE(predictions, test$Log.Salary))
MAE <- c(MAE, MAE(predictions, test$Log.Salary))
pred.error.rate <- RMSE / mean(test$Log.Salary)
}
quality.est <- data.frame(seq(1,20), round(R.sq,3), round(RMSE,3), round(MAE,3), round(pred.error.rate,3))
colnames(quality.est) <- c("Run", "R-Squared", "RMSE", "MAE", "Prediction Error Rate")
average <- c("Average", round(mean(R.sq), 3), round(mean(RMSE), 3), round(mean(MAE), 3), round(mean(pred.error.rate), 3))
quality.est <- rbind(average, quality.est)
kable(head(quality.est))| Run | R-Squared | RMSE | MAE | Prediction Error Rate |
|---|---|---|---|---|
| Average | 0.919 | 0.142 | 0.105 | 0.012 |
| 1 | 0.95 | 0.119 | 0.087 | 0.01 |
| 2 | 0.93 | 0.136 | 0.11 | 0.012 |
| 3 | 0.954 | 0.12 | 0.095 | 0.01 |
| 4 | 0.95 | 0.128 | 0.1 | 0.011 |
| 5 | 0.954 | 0.134 | 0.107 | 0.011 |
mean.R.sq <- round(mean(R.sq) * 100, 1)
mean.RMSE <- round(mean(RMSE), 3)
mean.pred.er <- round(mean(pred.error.rate) * 100, 1)Overall, the results from the testing test indicate that the model is suitable in predicting salary. On average, 91.9% of the variability in the university doctor’s salary can be explained by the explanatory variables. Note that the average from the testing set is only slightly worse than the training set (R-squared = 94.3%). On average, the prediction for log salary is off by $0.142 based on RMSE, resulting in a prediction error rate of 1.2%.
For majority of this analysis, the salary disparities between male university doctors and female university doctors was the focal point. In this section, we are interested in investigating a pattern of inequalities against women in receiving promotions (i.e. full professors). Such that, we are interested if the underlying population proportion of full professors among male university doctors is greater than the population proportion of full professors among female university doctors.
The table below summarizes the information required to perform a two-sample test for proportions. Note that “success” is defined as being a full professor.
prop.df <- data.frame(
Population = c(1, 2),
Description = c("Male", "Female"),
Sample.Size = c(nrow(male), nrow(female)),
Count.Successes = c(nrow(male[male$Rank == "Full_Professor",]), nrow(female[female$Rank == "Full_Professor",])),
Count.Failures = c(nrow(male[male$Rank != "Full_Professor",]), nrow(female[female$Rank != "Full_Professor",])))
prop.df$Sample.Proportion = round(prop.df$Count.Successes/prop.df$Sample.Size,4)
kable(prop.df)| Population | Description | Sample.Size | Count.Successes | Count.Failures | Sample.Proportion |
|---|---|---|---|---|---|
| 1 | Male | 155 | 69 | 86 | 0.4452 |
| 2 | Female | 106 | 16 | 90 | 0.1509 |
The following steps conduct a hypothesis test to make an inference about the underlying population proportion of full professors among male university doctors versus female university doctors.
Let \(p_1\) = proportion of full professors that are male and \(p_2\) = the proportion of full professors that are female
alpha <- 0.05Determine the appropriate critical value from the standard normal distribution associated with a right hand tail probability of \(\alpha = 0.05\). The appropriate critical value is 1.64.
critical.value <- qnorm(1 - alpha)p <- sum(prop.df$Sample.Proportion)
p1 <- prop.df$Sample.Proportion[1]
p2 <- prop.df$Sample.Proportion[2]
s1 <- prop.df$Count.Successes[1]
s2 <- prop.df$Count.Successes[2]
n1 <- prop.df$Sample.Size[1]
n2 <- prop.df$Sample.Size[2]
z <- (p1 - p2) / (sqrt( p*(1-p) * ((1/n1)+(1/n2)) ))
prop.test <- prop.test(c(s1, s2), c(n1, n2), conf.level=alpha, alternative = "greater")Reject \(H_0\) since \(4.76 > 1.64\). There is significant evidence at the \(\alpha = 0.05\) level that the proportion of male university doctor full professors is greater than the proportion of female university doctor full professors. The percentage of percentage of full professors was approximately 29% lower for female doctors than male doctors.
The confidence interval for the difference in population proportions is given by the following formula:
We are 95% confident that the percentage of university doctors that are full professor is between 19.1% and 39.8% lower among female doctors than male doctors.
dif <- p1 - p2
se <- sqrt( (p1*(1-p1)/n1) + (p2*(1-p2)/n2) )
lower <- dif - (critical.value * se)
upper <- dif + (critical.value * se)The goal of this project was to analyze the dataset used in a workplace gender discrimination case and utilize various statistical techniques to conduct inferences about gender pay and promotion inequalities. Throughout the analysis, we investigated questions to understand the population mean salary differences between male and female university doctors, predict a university doctor’s salary and determine which variables are the most significant predictors, and examine the proportion of full professors that are male versus female.
Overall, there is no disputing that male university doctors are paid significantly higher than female university doctors. However, this may or may not uphold when controlling for variables such as professorship level, certification, and emphasis. When doing so, there was significant statistical evidence that male university doctors are paid higher than female university doctors only for assistant and associate professors that are board certified and in the clinical emphasis. Additionally, from the multiple linear regression we found that gender is not a significant predictor of the university doctor’s salary when including the other variables in the dataset. Other factors such as department and professorship level play a larger role in predicting salary. This does not discount the notable inequality between the proportion of male full professors and female full professors. As a result, if a lower proportion of female university doctors are promoted to a full professor, then the female university doctor is likely to have a lower salary than male university doctors.